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The diffusion quantum Monte Carlo (QMC) method gives a stochastic 
solution to the Schrodinger equation. This approach has recently been 
receiving increasing attention in chemical applications as a result of 
its high accuracy. However, reducing statistical uncertainty remains a 
priority because chemical effects are often obtained as small differences 
of large numbers. We give as an example the singlet-triplet splitting of 
the energy of the methylene molecule CH^. 

We have implemented the QMC algorithm on the Cyber 205, first as a 
direct transcription of the algorithm running on our VAX 11/780, and 
second by explicitly writing vector code for all loops longer than a 
crossover length C*. We discuss the speed of the codes relative to one 
another as a function of C*, and relative to the VAX. Since CHg has 
only eight electrons, most of the loops in this application are fairly 
short. The longest inner loops run over the set of atomic basis 
functions. We discuss the CPU time dependence obtained versus the number 
of basis functions, and compare this with that obtained from traditional 
quantum chemistry codes and that obtained from traditional computer 
architectures. Finally, we discuss some preliminary work on restruc- 
turing the algorithm to compute the separate Monte Carlo realizations in 
parallel — potentially allowing vectors of unlimited length. 
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1. BACKGROUND 


In recent years Monte Carlo methods have been increasingly 
applied to quantum-mechanical problems. Quantum Monte Carlo (QMC) 
methods fall into two major categories. Variational QMC''’ is a 
method of evaluating expectation values of physical quantities with a 
given (generally optimized) trial wave function The procedure 

in effect amounts to evaluating a ratio of two integrals, although 

the actual Monte Carlo procedure is generally more sophisticated. 

2 

The second major category of QMC is the “exact" type. In these 
latter approaches the Schrodinger equation is actually "solved". It 
is not necessary to already have a highly accurate wave function in 
order to compute the expectation values. Properties of interest are 
in effect "measured" as the system evolves under the Schrodinger 
equation. When a stationary state is obtained, averages of the 
measured quantities give the desired expectation values. 

Only recently have chemical calculations by exact QMC methods 
3 4 

been carried out. ’ We will discuss here one such QMC method — 
the fixed-node, diffusion QMC — which we have been using in cal- 
culating molecular energies. In Section 2 we present the basic 
theory. Section 3 describes the algorithm. The implementation of 
this algorithm on the Cyber 205, its optimization, and results, are 
discussed in Section 4. 
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2. BASIC THEORY 

The Schrodinger equation may be rewritten in imaginary time, 
and with a constant shift in the zero of energy in the following form: 

3 ' t ’|' t> - [DV 2 - *( R) * E t ] f(R,t) . (1) 

Here D = fi / 2m g , R is the three-N dimensional coordinate vector 
of the N electrons, and V(R) is the potential energy (the Coulomb 
potential for a molecular system). Equation (1) is simply a 
diffusion equation combined with a first-order rate process, and thus 
may be readily simulated. The function ^(R.t) plays the role of the 
density of diffusing particles. These particles undergo branching 
(exponential birth or death processes) according to the rate term 
[Ey - V(R)] T(R) . Thus, the number of diffusers increases or 
decreases at a given point in proportion to the density of diffusers 
already there. 

The steady-state solution to Eq. (1) is the ground-state 
eigenfunction of the Schrodinger equation. Furthermore, the value of 
Ey at which the population of diffusers is asymptotically constant 
gives the energy eigenvalue Eg. The lowest eigenstate, however, is 
that of a Bose system. In order to treat a Fermi system, such as a 
molecule, we need to impose anti -symmetry on f(R). A method which 
does this, and at the same time allows us to sample more efficiently 
(to reduce our statistical error), is importance sampling with an 
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anti-symmetrized importance function 4^. The zeros (nodes) of ^ 
become absorbing boundaries for the diffusion process, maintaining 
the anti-symmetry. A simple form for which gives the necessary 
anti -symmetry is a Slater determinant of molecular orbitals 
multiplied by a symmetric function of the coordinates. 

To implement importance sampling, one simply multiplies Eq. (1) 
by and rewrites it in terms of a new probability density f(R,t) 
given by 


f(R>t) = Yj (R) T(R,t). 


( 2 ) 


The resultant equation for f can be written as 

!!■ = 0V 2 f + [E t - E L (R)]f - DV.[fF Q (R)] . (3) 

The local energy E^ (JR) and the "quantum force" Fg (JR ) are simple 
functions of 4j(R). Eq. (3), like Eq. (1), is a generalized 
diffusion equation, now with the addition of a drift term, due to the 
effect of Fg. It is Eq. (3) that we solve stochastically. Using a 
Green's function approach, our diffusers are made to follow a "random 
walk" (Markov chain) in such a way that their asymptotic distribution 
is given by the steady-state solution, f^R), of Eq. (3). Properties 
of interest (such as the energy) are measured during the "walks", and 
are thus averages over the distribution f^R). 
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3. ALGORITHM 


We present here an outline of the algoritnm for performing 
diffusion QMC. For more detail see Ref. 4. This algorithm is not 
structured specifically for the architecture of the Cyber 205. We 
will return to this point in the next section. 

(0) Initialization . First generate an ensemble of N c 
configurations of the N-electron system. Typically N c « 100-500. 
These coordinates may be chosen randomly, or more efficiently from 
the distribution I ( R ) I . This initial distribution is 

f ( R, t=0).° 

(1) Loop over blocks . In each block: 

(2) Repeatedly loop over the ensemble until the time in each 
configuration has reached the chosen target time. For each 
member of the ensemble compute the inverse of the Slater 
matrix. Then: 

(3) Loop over the electrons . Compute F^ for the current 
electron. Move to 

r = r + DxFq + x (4) 

where t is the discrete time-step size, and X is a 
3-dimensional Gaussian random variable with a mean of zero 
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and a variance of 2 Dt. This corresponds to the diffusive 
motion. If the electron crosses a node, eliminate the 
configuration from the ensemble and continue loop (2) over 
the ensemble. Otherwise update the Slater matrix and its 
inverse, and continue loop (3). 

After all electrons in the current configuration have been 
moved, advance the time associated with this new configuration 

I I 

R by t. Calculate E, (R ). Also calculate the branching 
factor, or multiplicity. 

M = exp (-t{[E l (R) + E l (R‘)]/ 2 - E t ». (5) 

Return M copies of this configuration to the ensemble. This 
branching, or birth and death process, corresponds to the rate 
term in Eq. (3). Weight all averages by M. Continue loop (2). 
After all members of the ensemble have reached the target time, the 
current block is finished. Use <E^> t0 update Ey. Store <E^> 
and the other averages. "Renormal ize" the ensemble back to its 
original size N . (This is necessary because the population grows 
or shrinks exponentially. Although we have endeavored to make the 
exponent close to zero [cf Eq. (5)3, asymptotically at large time tne 
population will either vanish or overflow the allocated storage.) 
Reset all averages to zero. Continue loop (1) for the desired number 
of blocks. 

(4) Average over blocks . 
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4. CYBER 205 IMPLEMENTATION. 


Tiie problem we chose to study is the singlet-triplet energy 
splitting of the methylene molecule, CHg is fairly typical 

of the molecules we have been studying by QMC, in terms of the number 
of electrons and the number of nuclei. As a result, most of the 
inner loops in this application are quite short. The longest inner 
loop runs over the set of atomic basis functions. With this in mind, 
we present our results on the relative performance of the Cyber 205 
and the VAX 11/780. To compare with the CDC 7600, we note that our 
code runs almost exactly ten times faster on the 7600 than on the VAX. 

We have implemented the QMC algorithm on the Cyber 205, initially 
by simply transcribing our working program from the VAX to the Cyber. 
The major impediment at this stage was the lack of unformatted 1/0 on 
the Cyber and, even worse, its inability to handle logical records 
longer than 137 bytes. After rewriting these portions of the code, 
the program finally ran. 

With automatic vectorization Doth on and off, the Cyber ran 
approximately 16 times the speed of the VAX. Apparently, any 
speed-up from vectorization of the longer loops was lost to the 
start-up time for vectorizing the short loops. It seemed clear that 
explicit vectorization was required. Thus, as our next step, all 
long inner loops of constant length were written explicitly in vector 
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syntax, while short constant-length loops were left as DO loops. 

Most loops in the code, however, are of variable length. These were 
all recoded in the form: 


IF (length .GT. C*) THEN 


[Vector code] 


ELSE 

[Scalar code] 

END IF. 

We present in Figure 1 our performance results as a function of 
the crossover length C*. At values of C* greater than 26 the scalar 
section of code is always being executed, and thus the curve flattens 
out. For C* less tnan approximately 16, it appears that vector 
start-up time hinders performance. The optimum crossover point 
appears to be around 16. The lowest of the three curves corresponds 
to the implementation described above. Subroutine calls are quite 
costly on tne Cyber 205. Thus in the middle curve we show the result 
of removing two short subroutines (both written in IF-THEN-ELSE 
form) and substituting vector code directly into the calling 
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programs. The speed-up is fairly dramatic, providing a peak speed of 
close to 20 times the VAX (up from 17). 

Interestingly, although the compliler recognizes that A**2 should 
be replaced by A*A, inside of vector code A**2 calls the float-to-an- 
integer-power routine. Needless to say, this is costly. Essentially, 
changing one line of vector code from A**2 to A*A led to the improve- 
ment shown in the top curve. Clearly the improvement is most 
pronounced for small C*, where this line of code is being executed 
more frequently. 

As mentioned earlier, the longest inner loop is over the number 

of atomic basis set functions, n. Traditional quantum chemistry 
4 5 

codes scale as n or n . Thus increasing the size of the basis 
set can be very costly. In our QMC approach, the algorithmic 
dependence on n is linear. In Fig. 2 we plot the relative run times 
as a function of basis set size on both the VAX (upper curve) and the 
Cyber 205 (lower curve). Both curves are indeed fairly linear in n. 
However, the slope for the Cyber is almost flat. This smaller slope 
is due to an increase in the vector length rather than an increase in 
the number of machine instructions being executed. The result is a 
speed enhancement of 30 over the VAX (up from 20) by n=50. 

Although a factor of 30 over the VAX (or equivalently a factor 
of 3 over the 7600) is certainly good, it is nowhere near our hoped 
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for performance. This can be explained by the fact that even loops 
of length 50 are relatively short on the Cyber 205. Possibly more 
important, however, is that the relatively long inner loops constitute 
only a fraction of the code being executed. Thus, truly high speed 
for this kind of application requires on architectural rewrite of the 
code. 


Looking over the algorithm (cf Sect. 3) it is clear that the 
entire structure is highly parallel. This is a fairly general 
characteristic of Monte Carlo codes. Thus, on a parallel processor 
the loop (1) over blocks can be eliminated, and each block can be 
computed independently on a separate processor. There is no communi- 
cation required between processors until the very end, when [step- (4)] 
the average over blocks is computed. 

For a truly efficient Cyber 205 algorithm, however, loop (1) is 
too short to vectorize, generally ranging between 10 and 100. Loop 
(2) is much more desirable to vectorize, with N c = 100-500. To do 
so, this loop must be made innermost in the new algorithm. In other 
words, the entire ensemble must be treated in parallel. Furthermore, 
the vector length is dynamic, since at each time-step the birth and 
death process modifies the ensemble size. We are currently develop- 
ing this fully vector code for future implementation. This code 
appears to have great potential for fully exploiting the vector 
capabilities of the 205. 
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Finally in Table 1, we present our results on the singlet-triplet 


energy splitting of methylene, and compare these results with theory 
and experiment. 
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TABLE 1. 


The ground-state (^Bi) and first-excited state ( ^ Ai ) energies of methylene. 


Method 

3 

B^-energy (hartrees) 

^-energy (hartrees) 

Hartree-Fock 

-38.9343 

-38.8944 

CI-SD 

-39.1071 

-39.0955 

CI-SDQ (est.) 

-39.122 

-39.105 

QMC 

-39.128*0.004 

-39.108*0.004 

Experimental 

-39.148 




1 A - 
A 1 

0 

energy (kcal/mole) 

Cl 

9.9-11.3 

Expt 

8.5-19.6 

QMC 

12.3*3.4 
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Figure 1. Relative speeds of the Cyber 205 and the VAX 11/780 for 
quantum Monte Carlo calculations of the ground-state energy of CH^ . 
The crossover point C* is the vector length below which variable- 
length loops are run in scaler mode . Thus , for large C* these loops 
are all run in scaler mode, whereas for very small C*, vector start- 
up time hinders performance. The three curves correspond to differ- 
ent degrees of hand-optimization of the code. See text for details. 
Note that the curves interpolating the data points are simply poly- 
nomial fits to the data. The actual curve for a particular molecule 
is a set of steps at the values of the various loop lengths that 
occur in the problem. The fits can be considered an "average" 
behavior for this type of calculation. 
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CPU TIME VS VECTOR LENGTH 


TIME (ARBITRARY UNITS) 


CYBER 205 


BASIS SET SIZE 


Figure 2. CPU time versus the number of atomic basis set functions, 
n. Conventional codes scale as n^ with \ as 4-6 while QMC scales 
simply as n. Both the VAX and Cyber show this n dependence clearly 
However, the slope for the Cyber is almost zero. At n=16 the Cyber 
is 20 times the speed of the VAX while at n=50 the Cyber is 30 
times faster. 




